Bike Sharing Operator

This project presents an in-depth analysis of a bike sharing operator based in London, UK. The primary goal of any bike sharing business is to ensure convenience, accessibility, and a seamless user experience — from the start of a trip to its finish — while maintaining profitability in response to increasing demand for sustainable urban mobility. Whether for commuting or leisure, users expect well-maintained, affordable bikes that support a reliable alternative to traditional transport.

To explore how the bike sharing business can optimize its resources and refine its strategic direction, this project first establishes key research questions aimed at supporting the marketing and operations teams in maximizing efficiency and identifying the most effective paths to success. Through extensive data mangling and in-depth analysis, the insights aims to be able to provide valuable recommendations for the bike sharing operator to enhancing profitability, minimizing unnecessary operational costs, and staying aligned with its mission to deliver the best possible experience for its users.

However, given the limitations of the available dataset, several assumptions and disclaimers must be acknowledged.
Disclaimer
Dataset only contains data from August to September in 2017.
There are originally 826 stations, but 53 stations are missing from the station dataset. Hence, this analysis will only be conducted on the stations that exist in the station dataset
📋 Assumption Table
Assumptions Reasoning
Capacity is assumed to be the number of bikes available at a station at the point of time that this dataset captures Flow dataframe can be used in Research Question 5 to track the inflow and outflows of traffic temporarily
Duration less than 30 minutes = Commuters, Duration more than 30 minutes = Leisure Riders Commuters tend to use bikes as a last-mile travel hence wouldn’t take more than 30 minutes. Of course it is important to note, there are leisure riders that ride lesser than 30 minutes and instances where commuters may ride more than 30 minutes.
Each journey is assumed to be revenue and duration is 15minutes. Pricing adopted from Lime This is to allow the estimation of disruption cost due to limitations of dataset available

Dataset used for analysis 💾

While this analysis is based on real historical bike operation data, it is important to note that the dataset should not be considered a representative sample of the broader population. Due to missing and insufficient data, the findings may not accurately reflect the characteristics of the larger system. The following datasets were used in this project:

Dataset Fields
Stations Station.ID, Capacity, Station.Name, Location
Journeys Journey.Duration, Journey.ID, End.Date, End.Month, End.Year, End.Hour, End.Minute, End.Station.ID, Start.Date, Start.Month, Start.Year, Start.Hour, Start.Minute, Start.Station.ID

Research Questions

The following research questions were designed with the marketing and operations departments in mind. They serve as a framework to uncover key insights into the current business structure, identify operational gaps, and better understand the state of the bike sharing service. These insights will support more informed resource allocation and strategic decision-making within the respective departments.

Question 1: What is the overall distribution of users of the bike sharing service?
Question 2: Which stations experience the highest traffic on weekdays compared to weekends?
Question 3: Are there observable changes in ridership trends between August and September that may indicate seasonal effects?
Question 4: What are the peak usage periods throughout the day?
Question 5: Which critical stations require additional resource allocation based on usage intensity?
Question 6: How do riding patterns differ between leisure users and commuters across weekdays and weekends?
Question 7: When is the optimal time to schedule bike maintenance to minimize service disruption?
Question 8: What is the distribution of journey durations among users?
Question 9: What are the most common or popular routes taken by users?
Question 10: What is the estimated cost of service disruptions based on usage data?

1. Important Relevant Libraries & Dataset

library(dplyr)
library(lubridate)
library(leaflet)
library(leaflet.extras)
library(geodata)
library(tidyr)
library(ggplot2)
library(gganimate)
library(plotly)
library(scales)

# Datasets
station <- read.csv("./Bike Sharing Dataset/stations.csv")
journey <- read.csv("./Bike Sharing Dataset/journeys.csv")

# To be used with Maps
UK <- gadm(country='GBR', level=3,path= tempdir())

head(journey,10)
##    Journey.Duration Journey.ID End.Date End.Month End.Year End.Hour End.Minute
## 1              1320       2351       14         9       17       14         53
## 2               960      11205       15         9       17       16          3
## 3               780      14816       15         9       17       17         38
## 4               720      11403       15         9       17       19         42
## 5              1320      13910       14         9       17        9         46
## 6              1140      10684       19         9       17        8         50
## 7               720       9977       15         9       17       19          8
## 8               780       4894       13         9       17       17         52
## 9              1200       2027       17         9       17       11         38
## 10              240       6584       18         9       17       14         27
##    End.Station.ID Start.Date Start.Month Start.Year Start.Hour Start.Minute
## 1             514         14           9         17         14           31
## 2             350         15           9         17         15           47
## 3             197         15           9         17         17           25
## 4             201         15           9         17         19           30
## 5             288         14           9         17          9           24
## 6             130         19           9         17          8           31
## 7             154         15           9         17         18           56
## 8             104         13           9         17         17           39
## 9             250         17           9         17         11           18
## 10            219         18           9         17         14           23
##    Start.Station.ID
## 1               589
## 2               396
## 3               298
## 4               224
## 5               581
## 6               273
## 7               509
## 8               135
## 9               616
## 10              405
head(station,10)
##    Station.ID Capacity                         Station.Name
## 1           1       19           River Street , Clerkenwell
## 2           2       37       Phillimore Gardens, Kensington
## 3           3       32 Christopher Street, Liverpool Street
## 4           4       23      St. Chad's Street, King's Cross
## 5           5       27        Sedding Street, Sloane Square
## 6           6       18       Broadcasting House, Marylebone
## 7           7       16    Charlbert Street, St. John's Wood
## 8           8       18          Lodge Road, St. John's Wood
## 9           9       19             New Globe Walk, Bankside
## 10         10       18                Park Street, Bankside
##                 Location
## 1   (51.529163,-0.10997)
## 2  (51.499606,-0.197574)
## 3  (51.521283,-0.084605)
## 4  (51.530059,-0.120973)
## 5   (51.49313,-0.156876)
## 6  (51.518117,-0.144228)
## 7    (51.5343,-0.168074)
## 8  (51.528341,-0.170134)
## 9   (51.507385,-0.09644)
## 10 (51.505974,-0.092754)

1.1 Data Manipulation, Cleaning & Sorting

This step is to prepare the dataset ready for analysis and ensure that all missing values, anomalies are dealt with accordingly. Data type and additional columns are added for efficiency in analysis in the later part.

### Dealing with Station Dataset ###

# Locate all the stations ID that are missing from the station dataset
station <- station %>% arrange(Station.ID) #There are some stations ID missing

# Vector of all possible station IDs
all_ids <- 1:826

# Convert Station.ID to numeric in case it's not already
station$Station.ID <- as.numeric(station$Station.ID)

# Find missing IDs
missing_ids <- setdiff(all_ids, station$Station.ID)

# Split Coordinates into Lon and Lat
station <- separate(station,Location, into=c("lat","lng"), sep=",")
station$lat <- as.numeric(gsub("\\(","",station$lat))
station$lng <- as.numeric(gsub("\\)","",station$lng))


#### To ensure Stations in Journey Dataset tallies with Station Dataset ####

# To check if any users start/end station is in the missing_ids (2382 rows)
journey_w_missingid <- journey[journey$Start.Station.ID %in% missing_ids | journey$End.Station.ID %in% missing_ids,] #Dataset that contain all the rows that have missing station ID
index <- rownames(journey_w_missingid)


# NEW JOURNEY DATASET WITHOUT ROWS THAT INCLUDE MISSING STATION IDS (151414 rows; original 153796 rows) #
journey_new <- journey[!(rownames(journey)%in% index),]

rownames(journey_new) <- NULL
journey_new <- journey_new %>% select(Journey.ID,Journey.Duration,Start.Station.ID,Start.Date,Start.Month,Start.Year,Start.Hour,Start.Minute,End.Station.ID,End.Date,End.Month,End.Year,End.Hour,End.Minute)

### Manipulating Journey Dataset ###

# Creating Journey Start Time Column
journey_new$Start.Time <- paste0(journey_new$Start.Year,"-",journey_new$Start.Month,"-",journey_new$Start.Date," ",journey_new$Start.Hour,":",journey_new$Start.Minute,":0")
journey_new$Start.Time <- ymd_hms(journey_new$Start.Time)

# Creating Journey End Time Column
journey_new$End.Time <- paste0(journey_new$End.Year,"-",journey_new$End.Month,"-",journey_new$End.Date," ",journey_new$End.Hour,":",journey_new$End.Minute,":0")
journey_new$End.Time <- ymd_hms(journey_new$End.Time)


# Creating Journey Duration
journey_new$Duration_s <- as.integer(difftime(journey_new$End.Time, journey_new$Start.Time, units = "secs"))
journey_new$Duration_m <- as.integer(difftime(journey_new$End.Time, journey_new$Start.Time, units = "mins"))

3. Data Analysis Based on Research Questions

The following section dives into the dataset to unravel meaningful insights to provide more clarity for our research questions.

3.1 Research Question 1: What is the overall distribution of users of the bike sharing service?

# First plot heatmap of a map showing the Start.Station to locate all the stations distributed around London

plot1_df <- journey_new %>% group_by(Start.Station.ID) %>% summarise(Count = n())
plot1_df <- plot1_df %>% left_join(station, by=c("Start.Station.ID"="Station.ID"))

# Intensity scale gets skewed by a few high-count stations, making lower ones indistinguishable hence this is to make color more sensitive to small differences and avoids domination by outliers.
plot1_df$logCount <- log1p(plot1_df$Count) 


## Plotting the Map
plot1 <- leaflet(plot1_df) %>% addProviderTiles(providers$CartoDB.DarkMatter) %>% addWebGLHeatmap(lng=~lng,lat=~lat, intensity= ~logCount, size = 30, opacity = 1) %>% setView(lng = mean(plot1_df$lng), lat = mean(plot1_df$lat), zoom = 12)
plot1

Based on the map plot above, the bike stations are denoted with red spots which shows the distributions clearly across London, as well as the popularity of each station denoted with the gradient of colors. Observations can be made that most of the stations are concentrated in city area while the outskirts of London has lesser stations available so that the demands can be met for users commuting to work or for leisure purposes.

3.2 Research Question 2: Which stations experience the highest traffic on weekdays compared to weekends?

# Adding a column that provides logical values to determine if the journey was on a weekday or weekend
journey_new$Is.Weekend <- ifelse(wday(journey_new$Start.Time) %in% c(1, 7), T, F)

# Adding a column to extract the day with its name
journey_new$dow <- wday(journey_new$Start.Time,week_start=1,label=T,abbr=F)

# Preparing the dataframe for plotting
plot2_df <- journey_new %>% group_by(Start.Station.ID,dow,Is.Weekend) %>% summarise(Count = n()) %>% arrange(dow,Is.Weekend,Start.Station.ID)
plot2_df <- plot2_df %>% group_by(Start.Station.ID,Is.Weekend) %>% summarise(Avg.Demand = mean(Count)) %>% arrange(Start.Station.ID)
plot2_df <- plot2_df %>% left_join(station, by= c("Start.Station.ID" = "Station.ID")) %>% select(Station.Name,Start.Station.ID,Is.Weekend,Avg.Demand)

# Top 10 Stations for Weekdays and Weekends
plot2_weekday <- plot2_df %>% filter(Is.Weekend== FALSE) %>% arrange(desc(Avg.Demand)) %>% ungroup() %>% slice_head(n=10)
plot2_weekday <- plot2_weekday %>% ggplot(aes(x=reorder(Station.Name,Avg.Demand),y=Avg.Demand)) + geom_bar(stat='identity',fill="steelblue") +coord_flip() + xlab("Station Names") + ylab("Average Demand") + labs(title="Top 10 Hot Spot during Weekdays") + theme_minimal()


plot2_weekend <- plot2_df %>% filter(Is.Weekend==TRUE) %>% arrange(desc(Avg.Demand)) %>% ungroup() %>% slice_head(n=10)
plot2_weekend <- plot2_weekend %>% ggplot(aes(x=reorder(Station.Name,Avg.Demand),y=Avg.Demand)) + geom_bar(stat='identity',fill="#2E8B57") + coord_flip() + xlab("Station Names") + ylab("Average Demand") + labs(title="Top 10 Hot Spot during Weekends") + theme_minimal()

plot2_weekday

plot2_weekend

Referencing from the two charts above, visibly they show the top 10 stations that are popular during Weekdays and Weekends. However, diving deeper into the general information of those areas, we are able to find out more valuable insights and hence able to make more sense to the demands respectively. During weekdays, bike demand usually shifts towards commuting and daily travel and users usually use bikes as a last-mile travel from major train stations or bus terminals to their offices. This is evident from King’s Cross, Waterloo Stations and Liverpool Street as they are commonly transport hubs where people utilized to get to work. Business districts like Queen Street and The Borough are also found in the chart where people are usually concentrated at during the day, whether commuting to work, running for errands, having a meal or coffee or leaving work, these areas are hotspots for bike sharing services.

As for the Weekends, users are tend to use bikes for leisure and recreational activities. Parks and scenic areas are attractive destinations for relaxing, exercising, or socializing as evident from an increase appearances of park locations: Hyde Park (e.g., Hyde Park Corner, Triangle Car Park, Albert Gate, Park Lane), Kensington Gardens (e.g., Black Lion Gate, Palace Gate), Queen Elizabeth Olympic Park, Shoreditch, Westminster (often popular for casual weekend visits).

Based on these insights, bike operator services can readjust the resources accordingly to anticipate such demands in those areas to ensure maximum utilization of bikes and minimize idling of bikes which can optimize resource allocation and save costs.

3.4 Research Question 4: What are the peak usage periods throughout the day?

# Develop a timeslot function to create buckets for each intervals in a day
timeslot <- function(hour,minute)
{
  a <- ifelse(minute < 30,sprintf("%02d:00-%02d:30",hour,hour), sprintf("%02d:30-%02d:00",hour,ifelse(hour == 23,0,hour+1)))
  
}

# Utilizing timeslot function to group demands in a day into buckets of timeslots to determine general demand in a day

journey_new$timeslot <- timeslot(journey_new$Start.Hour,journey_new$Start.Minute)
journey_new$timeslot <- as.factor(journey_new$timeslot)
plot4_df <- journey_new %>% group_by(timeslot,dow) %>% summarise(Demand = n())

# Plotting the bar chart
plot4 <- plot4_df %>% ggplot(aes(x=Demand, y= timeslot)) + geom_bar(stat='identity',fill="#008080") + theme_minimal() + theme(axis.text.y = element_text(size=6.5)) + ylab("Timeslot in a Day") +labs(title="Demand Distribution throughout the day in 30 minutes interval")


# Extracting the Top 3 Demands from each day
top3_each_day <- plot4_df %>% group_by(dow) %>% arrange(desc(Demand)) %>% slice_head(n=3) %>% mutate(Rank = "top") %>% ungroup()

# Get the rest by excluding the top 3
rest <- anti_join(plot4_df, top3_each_day, by = c("dow", "timeslot", "Demand")) %>% mutate(Rank = "rest")

# Merge the dataset
plot4_df2 <- rbind(top3_each_day,rest)

# Plotting gganimate to show top 3 timeslot demands from each day
plot4_df2 <- plot4_df2 %>% arrange(timeslot,dow)
plot4_animate <- plot4_df2 %>% ggplot(aes(x=Demand,y=timeslot,fill=Rank)) + geom_bar(stat='identity') + theme_minimal()+ theme(axis.text.y = element_text(size=7)) + scale_fill_manual(values=c("darkgrey","tomato")) + transition_states(dow,transition_length = 2,state_length = 1) + labs(title="Day of Week: {closest_state} | Demand of Each Day") +ylab("Timeslot in a Day")


plot4

This plot takes all the journeys from the Journey Dataset and separated them into buckets of timeslot regardless of the day or date. Hence, this plot shows the general demand in a day and with this insight, we are able to identify the peak period usage throughout the day. It can be observed that the bike sharing service’s demand generally peaks twice a day, specially between 1730 - 1800 and 0830 - 0900. This observation is likely due to users using the bikes for commuting to and from work in the bustling city of London. With this insight, business can plan accordingly to distribute necessary resources to support demands at different timing as well as plan possible maintenence time when the demand is at its lowest.

plot4_animate

To receive a more accurate analysis for businesses to take action, the plot is further split into specific day of the week so that the demands for each day can be better studied. Furthermore, indicators are made on each day to highlight the peak demands of the timeslot in the respective day. From observation, we can see some slight differences in the demand during Weekends as compared to Weekdays. On weekends, most of the peak periods are revolving around late afternoon to evening timing which could suggest users may use such bike services for leisure purposes around the park or the city as referencing to our previous findings above.

3.5 Research Question 5: Which critical stations require additional resource allocation based on usage intensity?

To answer this question, we will first setup perimeters to track historical departure and arrivals based on the dataset provided. A dataframe will then be created to track the inflow and outflow of the bikes from each station. This will give us a general idea of the demands of each station. We are also making an assumption that the capacity value is the current number of available bikes at the station so that we can observe the flows of bikes in each station.

# Preparing dataset
journeytest <- journey_new
journeytest$timeslotend <- timeslot(journeytest$End.Hour,journeytest$End.Minute)
journeytest$timeslotend <- as.factor(journeytest$timeslotend)
journeytest$Date <- date(journeytest$Start.Time)

station$Station.ID <- as.factor(station$Station.ID)


# Creating columns to track departures and arrivals of the stations
journeytest <- journeytest %>% select(Date,timeslot,Start.Station.ID,timeslotend,End.Station.ID)

departures <- journeytest %>% group_by(Start.Station.ID, timeslot,Date) %>% summarise(Outflow = n(), .groups="drop") %>% rename(Station.ID = Start.Station.ID,Timeslot=timeslot)

arrivals <- journeytest %>% group_by(End.Station.ID,timeslotend,Date) %>% summarise(Inflow = n(), .groups="drop") %>% rename(Station.ID=End.Station.ID,Timeslot=timeslotend)


# Prepare flow dataframe to show the inflow and outflow of bikes of each station

flow_df <- full_join(arrivals,departures, by =c("Station.ID","Timeslot","Date")) %>% mutate(Inflow = replace_na(Inflow,0),Outflow= replace_na(Outflow,0), Netflow = Inflow-Outflow, Station.ID = as.factor(Station.ID))
flow_df <- flow_df %>% arrange(Station.ID, Date, Timeslot)

flow_df <- flow_df %>% left_join(station,by="Station.ID")%>%group_by(Station.ID)%>% mutate(NewCapacity = Capacity + cumsum(Netflow))
flow_df <- flow_df %>% select(Date,Station.ID,Timeslot,Inflow,Outflow,Netflow,Capacity,NewCapacity) %>% arrange(Date,Station.ID,Timeslot)

flow_df <- flow_df %>% mutate(RemainingPct= NewCapacity/Capacity)
head(flow_df,5)
## # A tibble: 5 × 9
## # Groups:   Station.ID [2]
##   Date       Station.ID Timeslot    Inflow Outflow Netflow Capacity NewCapacity
##   <date>     <fct>      <fct>        <int>   <int>   <int>    <int>       <int>
## 1 2017-08-01 1          08:00-08:30      0       1      -1       19          18
## 2 2017-08-01 2          06:30-07:00      2       0       2       37          39
## 3 2017-08-01 2          07:00-07:30      0       1      -1       37          38
## 4 2017-08-01 2          08:30-09:00      1       0       1       37          39
## 5 2017-08-01 2          11:30-12:00      1       0       1       37          40
## # ℹ 1 more variable: RemainingPct <dbl>

From here, we will flag out stations that are risky using the RemainingPct perimeter set at 10%. What this means is that, it will compare the original capacity with the newcapacity value and whenever the newcapacity drops till 10% or below of the original capacity value, it will be flagged out. Next, we extract stations that have been flagged at least once into a new dataframe called risky_station shown below.

Then from this dataframe, the stations are counted on the number of times they have been flagged falling below this threshold, indicating that they experience high demand and low supply which is critical and vital as we want to prevent a situation where users want to rent a bike but no bikes are available in the station.

# Flagging out Stations to monitor flow of supply and demand of risky stations
risky_station <- flow_df %>% filter(RemainingPct<0.1)
head(risky_station,5)
## # A tibble: 5 × 9
## # Groups:   Station.ID [1]
##   Date       Station.ID Timeslot    Inflow Outflow Netflow Capacity NewCapacity
##   <date>     <fct>      <fct>        <int>   <int>   <int>    <int>       <int>
## 1 2017-08-03 154        08:00-08:30      0       6      -6       35           2
## 2 2017-08-03 154        08:30-09:00      1      11     -10       35          -8
## 3 2017-08-03 154        16:00-16:30      1       0       1       35          -7
## 4 2017-08-03 154        16:30-17:00      4       0       4       35          -3
## 5 2017-08-04 154        08:30-09:00      0       7      -7       35          -3
## # ℹ 1 more variable: RemainingPct <dbl>
riskydf <- risky_station %>% group_by(Station.ID) %>% summarise(TimesFlagged=n(),TotalNetOutflow = sum(-Netflow[Netflow<0]),.groups="drop") %>% arrange(desc(TimesFlagged)) %>% slice_head(n=10)

plot5 <- riskydf %>% ggplot(aes(x = TimesFlagged, y = Station.ID)) + geom_bar(stat='identity',fill="#1E90FF") + xlab("Station ID") + ylab("Number of Times Flagged") + labs(title="Top 10 Most Flagged Stations - High Risk") +theme_minimal()
plot5

According to the plot, these are the stations that have been flagged the most for falling below the threshold. Further analysis needs to be conducted to determine the demand that these stations are facing so that suitable recommendations could be provided for the business operators.

# Taking the Flagged Station IDs
flagged_stations <- c(riskydf$Station.ID)

# Filtering out only the top 10 flagged stations
plot5_altdf <- journey_new[journey_new$Start.Station.ID %in% flagged_stations,] 

plot5_altdf <- plot5_altdf %>% mutate(Date = date(Start.Time),Station.ID = as.factor(Start.Station.ID))
plot5_alt <- plot5_altdf %>% group_by(Station.ID,Date) %>% summarise(Station.Demands = n()) %>% ggplot(aes(x=Station.ID,y=Station.Demands)) + geom_bar(stat='identity', fill="#6A5ACD") + ylab("Demands of the Flagged Stations") + xlab("Station's IDs") + labs(title="Demand of the Top 10 Flagged Stations") + theme_minimal()
                                                                                                                                                       
plot5_alt

With this information, bike sharing operators can reallocate resources to ensure that these demands are met as they are likely to be situated at popular transport hub, intersections or highly populated areas with people using the bikes frequently. To specifically identify the locations of these stations, further analysis and insights will be provided at later plot as it is related with Users’ most popular route as well. Please refer below for the exact locations of these stations.

plot5_alt + transition_time(Date)+ labs(title="Date: {frame_time} | Daily Demand of the Top 10 Flagged Stations")

This visualisation shows the changes in demand on a daily basis across August and September 2017. From this animation, station 154 and 248 can be observed to have some of the highest demand in general throughout the days as compared to the other 8 stations.

3.6 Research Question 6: How do riding patterns differ between leisure users and commuters across weekdays and weekends?

# Ensure Original Dataset is not modified to preserve data integrity
plot6_df <- journey_new

# Riding for leisure else commuters (More than 30 mins) - Logical Value
plot6_df$Is.Leisure <- ifelse(plot6_df$Duration_m >= 30, T, F)

# Preparing dataset to track leisure and commuter riders on Weekdays
plot6_weekday <- plot6_df %>% filter(Is.Weekend == FALSE) %>% group_by(Is.Leisure, timeslot)  %>% summarise(Demand = n())

plot6_weekday <- plot6_weekday %>% ggplot(aes(x=timeslot,y=Demand,color=Is.Leisure, group=Is.Leisure)) + geom_line()+ theme_minimal() + theme(axis.text.x = element_text(angle = 90)) + labs(title="Leisure Vs Commuter Riders on a Weekday") + xlab("Timeslot in a Day")



# Preparing dataset to track leisure and commuter riders on Weekends
plot6_weekend <- plot6_df %>% filter(Is.Weekend == TRUE) %>% group_by(Is.Leisure, timeslot)  %>% summarise(Demand = n())

plot6_weekend <- plot6_weekend %>% ggplot(aes(x=timeslot,y=Demand,color=Is.Leisure, group=Is.Leisure)) + geom_line()+theme_minimal() + theme(axis.text.x = element_text(angle = 90)) + labs(title="Leisure Vs Commuter Riders on a Weekend") +xlab("Timeslot in a Day") 

plot6_weekday

Since there are no perimeters available for us to determine riders’ intention of riding the bikes, assumptions was made in the attempt to uncover insights for this research question. Here, we assume riders that rode more than 30 minutes are leisure riders while riders rode less than 30 minutes are commuters. Since commuters tend to use bikes as a last-mile travel therefore, their durations are usually lesser than leisure riders.

From this graph, we are able to observe the demand distributions between Leisure Riders and Commuters across the day in buckets of 30 mins interval. Generally we can see that leisure riders tend to be on the low-side during Weekdays as most people uses bike sharing services for school or work.

Disclaimer: It is also important to note that there are instances where Leisure Riders ride for less than 30 minutes and Commuters ride for more than 30 minutes

plot6_weekend

As for the Weekend chart, noticeable difference can be observed between the Leisure Riders’ Pattern and Commuters’ Pattern. There is a significant increase in the number of longer duration rides while the shorter duration rides has a significant drop in the demand. Therefore, with such patterns present, it is important for bike operators to cater for such differences in the rider types so that demand and supply can be anticipated on Weekdays and Weekends.

3.7 Research Question 7: When is the optimal time to schedule bike maintenance to minimize service disruption?

# Preparation of dataset
plot7_df <- journey_new %>% group_by(timeslot,Is.Weekend) %>% summarise(Demand = n())

# Find minimum demand row for weekday
min_weekday <- plot7_df %>% filter(Is.Weekend == FALSE) %>% ungroup() %>% filter(Demand == min(Demand)) 

plot7_weekday <- plot7_df %>% filter(Is.Weekend == FALSE) %>% ggplot(aes(x=Demand, y= timeslot)) + geom_bar(stat='identity',fill="steelblue") + theme_minimal() + theme(axis.text.y = element_text(size=6.5))
plot7_weekday <- plot7_weekday + annotate("segment",
           x = min_weekday$Demand + 600, xend = min_weekday$Demand,
           y = min_weekday$timeslot, yend = min_weekday$timeslot,
           colour = "red", arrow = arrow(type = "closed", length = unit(0.15, "inches")),size=0.05) +
  annotate("text",
           x = min_weekday$Demand + 10,
           y = min_weekday$timeslot,
           label = "Best Maintenance Period - Lowest Traffic", colour = "red", size = 3, hjust = -0.3) + ylab("Timeslot in a Day") + labs(title="Optimal Time to Schedule Bike Maintenance on Weekdays")


# Find minimum demand row for weekend
min_weekend <- plot7_df %>% filter(Is.Weekend == TRUE) %>% ungroup() %>% filter(Demand == min(Demand))

plot7_weekend <- plot7_df %>% filter(Is.Weekend == TRUE) %>% ggplot(aes(x=Demand, y= timeslot)) + geom_bar(stat='identity',fill="steelblue")+ theme_minimal() + theme(axis.text.y = element_text(size=6.5))
plot7_weekend <- plot7_weekend + annotate("segment",
           x = min_weekend$Demand + 150, xend = min_weekend$Demand,
           y = min_weekend$timeslot, yend = min_weekend$timeslot,
           colour = "red", arrow = arrow(type = "closed", length = unit(0.15, "inches")),size=0.05) +
  annotate("text",
           x = min_weekend$Demand + 10,
           y = min_weekend$timeslot,
           label = "Best Maintenance Period - Lowest Traffic", colour = "red", size = 3, hjust = -0.3)+ ylab("Timeslot in a Day") + labs(title="Optimal Time to Schedule Bike Maintenance on Weekends")

plot7_weekday

To locate the best maintenance period, we will identify the lowest traffic in the day so as to minimize the cost of disruption on the users. Looking at the indicator, the best timing for maintenance on a Weekday would be between 03:30 - 04:30.

plot7_weekend

As for the weekend, the optimal time to schedule bike maintenance would be between 05:30 - 06:00.

3.8 Research Question 8: What is the distribution of journey durations among users?

# Preparing Dataset

plot8_df <- journey_new %>% filter(Duration_m <= 100)


# Extracting Mean & SD to prepare for normal distribution
mean_duration <- mean(plot8_df$Duration_m, na.rm = TRUE)
sd_duration <- sd(plot8_df$Duration_m, na.rm = TRUE)

ggplot(plot8_df, aes(x = Duration_m)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "white") +
  geom_vline(aes(xintercept = mean_duration), color = "red", linetype = "dashed", size = 1) +
  geom_vline(aes(xintercept = mean_duration + sd_duration), color = "orange", linetype = "dotted", size = 1) +
  geom_vline(aes(xintercept = mean_duration - sd_duration), color = "orange", linetype = "dotted", size = 1) +
  labs(title = "Distribution of Ride Durations",
       x = "Duration (minutes)", y = "Count") +
  annotate("text", x = mean_duration, y = Inf, label = "Mean", vjust = -0.5, color = "red") +
  annotate("text", x = mean_duration + sd_duration, y = Inf, label = "+1 SD", vjust = -0.5, color = "orange") +
  annotate("text", x = mean_duration - sd_duration, y = Inf, label = "-1 SD", vjust = -0.5, color = "orange") +
  theme_minimal() +
  scale_x_continuous(breaks = seq(0, 100, by = 5))

By allocating bins to the duration in minutes, we are able to plot the distribution of the ride durations across all the journeys that the user had with the bike sharing operator. With the red line signifying its mean and the dotted lines are 1 standard deviation from its mean. From observation, most of the users took around 15 mins for their commuting purposes and significant decrement in the duration usage as the time increases.

3.10 Research Question 10: What is the estimated cost of service disruptions based on usage data?

Before we can approach this research question, due to the lack of available information we will be using Lime as a benchmark for all bike sharing services. Additionally, based on the analysis above, we will also be assuming that each ride typically last ~15 mins long.

According to Lime, all users will pay a Mb>flat rate of £1 and be charged on a rate of £0.20 per minute and taking each ride to be 15mins long, the total revenue for each journey recorded is 4 pounds for the company. When a certain timeslot is taken out for servicing, that amount of demand is noted as revenue lost hence estimated cost of service disruptions.

Disclaimer: There are other various fixed cost, variable cost involved in operating a bike sharing service however due to limited dataset, this assumption is used as a basic reference point to estimate the cost of service disruptions

# Preparing the distribution of demand in a day

plot10_df <- journey_new %>% group_by(timeslot) %>% summarise(Demand = n())
plot10_df <- plot10_df %>% mutate(Cost.of.Disruption = Demand * 4,CostLabel = scales::label_currency(prefix = "£", big.mark = ",")(Cost.of.Disruption))
plot10_df
## # A tibble: 48 × 4
##    timeslot    Demand Cost.of.Disruption CostLabel
##    <fct>        <int>              <dbl> <chr>    
##  1 00:00-00:30    785               3140 £3,140   
##  2 00:30-01:00    688               2752 £2,752   
##  3 01:00-01:30    572               2288 £2,288   
##  4 01:30-02:00    412               1648 £1,648   
##  5 02:00-02:30    355               1420 £1,420   
##  6 02:30-03:00    288               1152 £1,152   
##  7 03:00-03:30    248                992 £992     
##  8 03:30-04:00    194                776 £776     
##  9 04:00-04:30    171                684 £684     
## 10 04:30-05:00    169                676 £676     
## # ℹ 38 more rows
plot10 <- plot10_df %>% ggplot(aes(x=timeslot, y = Cost.of.Disruption,fill=Cost.of.Disruption)) + geom_bar(stat='identity') + theme_minimal() + theme(axis.text.x = element_text(angle = 90)) + scale_y_continuous(labels = dollar) + labs(title = "Estimated Cost of Disruption per Timeslot", y = "Cost (£)", x = "Timeslot") +scale_fill_gradient(low = "#00FF7F", high = "#FF4500")

plot10

Referencing to the chart above, the costs are denoted by the gradient colors of the bar and the green bars indicates the lowest cost of disruption if there are any faults in the stations, bikes or even the application software. This insight can be meaningful to bike sharing operators as they will be more prepared to integrate a stronger fail-safe measure or implement a robust system to deter downtime on their bikes, stations or software on peak periods. This can then help company to avoid incurring large amount of disruption cost and maintain satisfactory experience for all users regardless of the day or time.

4. Conclusion & Recommendations

Based on the analysis conducted, several strategic actions are recommended to enhance service quality, ensure operational efficiency, and support long-term growth and scalability for the bike-sharing operator.

As a key player in urban mobility, the company’s core mission is to deliver convenient, accessible, high-quality, and affordable bike rental services across the dynamic landscape of London. Given the city’s scale and infrastructure, bike-sharing provides a valuable alternative mode of transport, especially for short-distance travel.

To uphold service standards and user satisfaction, it is essential to minimize service downtime through regular bike maintenance, particularly during off-peak hours. Proactive planning should also be in place to manage anticipated surges in demand, ensuring that both bicycles and docking points are optimally available.

From a scalability perspective, certain stations flagged with high demand pose a potential risk of bicycle shortages. It is recommended that the company monitors these high-traffic stations closely and considers strategic redistribution of bikes or the installation of additional stations in underserved yet high-demand areas.

In a competitive market, maintaining leadership requires continuous responsiveness to user pain points and a commitment to resolving issues promptly. This not only enhances user trust but also supports customer retention and brand differentiation.

While many operators are diversifying their fleets and adapting fare structures to meet shifting demand patterns, it is crucial to remain aligned with the company’s original mission: to offer a service that is accessible, reliable, affordable, and user-centric. Staying true to this vision will be instrumental in sustaining the company’s relevance, resilience, and long-term success in the evolving mobility landscape.

5. References

MetMetters. (2017, October 10). UK Weather Review: September 2017. RMetS. https://www.rmets.org/metmatters/uk-weather-review-september-2017

Guardian News and Media. (2017, September 1). Britain’s summer 2017 was wetter but also warmer than average. The Guardian. https://www.theguardian.com/uk-news/2017/sep/01/britains-summer-2017-was-wetter-but-also-warmer-than-average

Best Bike Share and rental bicycles London: Lime vs. santander and alternatives. Cyclist. (2025, January 30). https://www.cyclist.co.uk/reviews/ridden-and-rated-ultimate-guide-to-london-s-bike-sharing-rental-bicycles

Analysis Done By: Patrick Tan